Introduction

Welcome to our brief exploration of machine learning topics through one of the most widely used and readily available datasets. The stock market!

The Plan

The plan is to:

  1. Explore some company's stock data independently via visuals, selections of metrics and some cursory statistical analysis.
  2. Explore connections between the stock trends of similar corporations in the tech industry.
  3. Consider a question I have as to whether or not forecasting is truly improved and by how much if we use similar companies past data to try and predict a target companies future stock prices.

Terminlogy and Tech Stack

  1. Arima models:

    • ARIMA, which stands for AutoRegressive Integrated Moving Average, is a class of models that explains a given time series based on its own past values, that is, its own lags and the lagged forecast errors.
  2. adfuller from statsmodels.tsa.stattools:

    • The Augmented Dickey-Fuller test (adfuller) is a type of statistical test called a unit root test. It's used to determine the presence of a unit root in a time series sample, which can help to understand if the time series is stationary or not.
  3. autocorrelation:

    • Autocorrelation, also known as serial correlation, is the correlation of a signal with a delayed copy of itself as a function of delay. In time series data, it's used to determine if a data set or time series is random or if there are underlying patterns.
  4. arima model:

    • The ARIMA model, or AutoRegressive Integrated Moving Average model, is a popular time series forecasting model that combines the ideas of autoregression (AR) and moving averages (MA). It aims to describe the autocorrelations in the data.
  5. statsmodels.graphics.tsaplots import plot_acf, plot_pacf:

    • plot_acf and plot_pacf are functions from the statsmodels library used to plot the autocorrelation function (ACF) and partial autocorrelation function (PACF) of a time series, respectively. These plots are useful for determining the order of AR and MA terms in an ARIMA model.
#!pip install yfinance

Apples, Oranges and Data Exploration

Now that we have downloaded the necessary packages let's start investigating some data for specific companies.

The Apple

Since I am quite the Apple fan I am going to start off with checking out how Apple has been doing over the years.

We are going to import the yfinance package then find its ticker name (the way it's entries are referred to in the csv file and extract a "sub dataframe" with the stock prices using the ticker name.

import yfinance as yf

# Define the ticker symbol
tickerSymbol = 'AAPL'

# Get data on this ticker
tickerData = yf.Ticker(tickerSymbol)

# Get the historical prices for this ticker
tickerDf = tickerData.history(period='5y')

# See the data
tickerDf
Open High Low Close Volume Dividends Stock Splits
Date
2018-08-06 00:00:00-04:00 49.695115 49.993764 49.472922 49.950760 101701600 0.0000 0.0
2018-08-07 00:00:00-04:00 50.010491 50.053495 49.398856 49.482479 102349600 0.0000 0.0
2018-08-08 00:00:00-04:00 49.229228 49.649724 48.863683 49.515930 90102000 0.0000 0.0
2018-08-09 00:00:00-04:00 50.060665 50.120394 49.503983 49.905369 93970400 0.0000 0.0
2018-08-10 00:00:00-04:00 49.715952 50.133130 49.550519 49.756710 98444800 0.1825 0.0
... ... ... ... ... ... ... ...
2023-07-31 00:00:00-04:00 196.059998 196.490005 195.259995 196.449997 38824100 0.0000 0.0
2023-08-01 00:00:00-04:00 196.240005 196.729996 195.279999 195.610001 35175100 0.0000 0.0
2023-08-02 00:00:00-04:00 195.039993 195.179993 191.850006 192.580002 50389300 0.0000 0.0
2023-08-03 00:00:00-04:00 191.570007 192.369995 190.690002 191.169998 61235200 0.0000 0.0
2023-08-04 00:00:00-04:00 185.520004 187.380005 181.919998 181.990005 115799700 0.0000 0.0

1258 rows × 7 columns

If one's eye's scan between big intervals of years then one may pretty quickly come to the conclusion that the stocks prices for Apple have been climbing. What is not as easy to do by examining the data in this manner is tell how big the ups and downs in stock price have been. It would be easier to quickly get a feel for these qualities if we had even a simple plot. Let's get that right now. We set up a figure size reasonable for our notebook and add labels (all using matplotlib) to get a modest useful visual. Note that I have chosen to plot closing prices as the independent variable. However, I would likely get similar findings with opening prices as the gap; between them doesn't vary as much as the variance between either metric taken over long distances of time. It may be of interest to the reader to explore this topic on its own!

import matplotlib.pyplot as plt

# Plot the closing price
plt.figure(figsize=(10, 6))
plt.plot(tickerDf['Close'])
plt.title('Closing price of AAPL')
plt.xlabel('Date')
plt.ylabel('Closing Price')
plt.grid(True)
plt.show()
#!pip install statsmodels
from statsmodels.tsa.stattools import adfuller

# Perform ADF test on the 'Close' prices
result = adfuller(tickerDf['Close'])

print('ADF Statistic: %f' % result[0])
print('p-value: %f' % result[1])
print('Critical Values:')
for key, value in result[4].items():
    print('\t%s: %.3f' % (key, value))
ADF Statistic: -0.647774
p-value: 0.859818
Critical Values:
	1%: -3.436
	5%: -2.864
	10%: -2.568
# Take the first difference of the closing prices
tickerDf['Close_diff'] = tickerDf['Close'].diff()

# Drop the missing values that were created by taking the difference
tickerDf = tickerDf.dropna()

# Perform ADF test on the differenced data
result = adfuller(tickerDf['Close_diff'])

print('ADF Statistic: %f' % result[0])
print('p-value: %f' % result[1])
print('Critical Values:')
for key, value in result[4].items():
    print('\t%s: %.3f' % (key, value))
ADF Statistic: -37.111304
p-value: 0.000000
Critical Values:
	1%: -3.436
	5%: -2.864
	10%: -2.568

Correlation Discussion Followed By Our Analyses

When you use plot_acf and plot_pacf from statsmodels.graphics.tsaplots, the resulting plots provide a visual representation of the autocorrelation and partial autocorrelation of a time series, respectively. Here's how you can interpret these plots:

  1. Plotting the Autocorrelation Function (ACF) with plot_acf:

    • Stationarity: If the bars (correlogram) in the ACF plot drop off quickly, the series might be stationary. If they remain significant over several lags, the series is likely non-stationary.

    • Seasonality: Regular patterns of peaks and troughs at consistent intervals can indicate seasonality. For instance, a peak every 12 lags in a monthly series suggests annual seasonality.

    • Randomness: If most bars are within the blue shaded region (confidence intervals) and are close to zero, the series might be random.

    • Model Identification: If the ACF cuts off after a certain number of lags, it suggests a possible MA order for an ARIMA model. If it declines gradually, an AR model might be more appropriate.

  2. Plotting the Partial Autocorrelation Function (PACF) with plot_pacf:

    • Model Identification: The PACF plot can help identify the order of the AR term. If the PACF drops to zero after a certain number of lags, it suggests that an AR model of that order might be suitable. For instance, if the PACF is significant for 2 lags and then drops off, an AR(2) model might be a good fit.

    • Lagged Relationships: Like the ACF, significant spikes in the PACF at specific lags indicate a relationship between the data point and its lagged values, but after controlling for other lags.

Here's a simple example using Python to visualize the ACF and PACF: We will run it in the next cell.

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from statsmodels.graphics.tsaplots import plot_acf, plot_pacf

# Generate some example data (e.g., from a DataFrame)
# data = pd.read_csv('your_data.csv')
# ts = data['your_column']

# For demonstration purposes, let's use random data
ts = np.random.randn(100)

# Plot ACF
plot_acf(ts, lags=40)
plt.title('ACF Plot')
plt.show()

# Plot PACF
plot_pacf(ts, lags=40)
plt.title('PACF Plot')
plt.show()

When interpreting these plots, it's essential to consider the blue shaded region, which represents the confidence intervals. Bars that extend beyond this region are considered statistically significant.

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from statsmodels.graphics.tsaplots import plot_acf, plot_pacf

# Generate some example data (e.g., from a DataFrame)
# data = pd.read_csv('your_data.csv')
# ts = data['your_column']

# For demonstration purposes, let's use random data
ts = np.random.randn(100)

# Plot ACF
plot_acf(ts, lags=40)
plt.title('ACF Plot')
plt.show()

# Plot PACF
plot_pacf(ts, lags=40)
plt.title('PACF Plot')
plt.show()
from statsmodels.graphics.tsaplots import plot_acf, plot_pacf

# Plot the ACF
plot_acf(tickerDf['Close_diff'], lags=50)
plt.show()

# Plot the PACF
plot_pacf(tickerDf['Close_diff'], lags=50)
plt.show()

ARIMA

We are

from statsmodels.tsa.arima.model import ARIMA

# Fit an ARIMA(1,1,1) model
model = ARIMA(tickerDf['Close'], order=(1,1,1))
model_fit = model.fit()

# Print out the summary of the model
print(model_fit.summary())
                               SARIMAX Results                                
==============================================================================
Dep. Variable:                  Close   No. Observations:                 1257
Model:                 ARIMA(1, 1, 1)   Log Likelihood               -2834.265
Date:                Sun, 06 Aug 2023   AIC                           5674.530
Time:                        14:51:30   BIC                           5689.937
Sample:                             0   HQIC                          5680.320
                               - 1257                                         
Covariance Type:                  opg                                         
==============================================================================
                 coef    std err          z      P>|z|      [0.025      0.975]
------------------------------------------------------------------------------
ar.L1          0.4184      0.288      1.453      0.146      -0.146       0.983
ma.L1         -0.4739      0.282     -1.683      0.092      -1.026       0.078
sigma2         5.3400      0.138     38.709      0.000       5.070       5.610
===================================================================================
Ljung-Box (L1) (Q):                   0.00   Jarque-Bera (JB):               455.93
Prob(Q):                              0.99   Prob(JB):                         0.00
Heteroskedasticity (H):               5.07   Skew:                            -0.08
Prob(H) (two-sided):                  0.00   Kurtosis:                         5.95
===================================================================================

Warnings:
[1] Covariance matrix calculated using the outer product of gradients (complex-step).
# Fetch the historical stock data for Microsoft (MSFT)
msft = yf.Ticker('MSFT')

# Get the historical prices for this ticker
msftDf = msft.history(period='1d', start='2020-1-1', end='2023-1-1')

# Print the first few rows of the data
msftDf.head()
Open High Low Close Volume Dividends Stock Splits
Date
2020-01-02 00:00:00-05:00 153.641607 155.528499 153.206173 155.422058 22622100 0.0 0.0
2020-01-03 00:00:00-05:00 153.196506 154.773747 152.944911 153.486786 21116200 0.0 0.0
2020-01-06 00:00:00-05:00 151.996623 153.951256 151.445062 153.883514 20813700 0.0 0.0
2020-01-07 00:00:00-05:00 154.164134 154.502799 152.228858 152.480438 21634100 0.0 0.0
2020-01-08 00:00:00-05:00 153.786716 155.596209 152.838435 154.909180 27746500 0.0 0.0

The Apple and the Orange Meet

# Import the necessary library
import matplotlib.pyplot as plt

# Plot the closing prices for Apple and Microsoft
plt.figure(figsize=(14, 7))
plt.plot(tickerDf['Close'], label='Apple')
plt.plot(msftDf['Close'], label='Microsoft')
plt.title('Apple vs Microsoft - Closing Prices')
plt.xlabel('Date')
plt.ylabel('Closing Price (USD)')
plt.legend()
plt.show()

Future Directions

This section will continue to develop so please revisit and contact me if you have any ideas or would like me to dive more deeply into a question you have. Thank you for visiting!

Other Large and Silently Growing Companies

# Fetch the historical stock data for The Home Depot (HD)
hd = yf.Ticker('HD')

# Get the historical prices for this ticker
hdDf = hd.history(period='1d', start='2020-1-1', end='2023-1-1')

# Print the first few rows of the data
hdDf.head()
Open High Low Close Volume Dividends Stock Splits
Date
2020-01-02 00:00:00-05:00 201.584009 202.209696 200.443032 202.117691 3935700 0.0 0.0
2020-01-03 00:00:00-05:00 199.798942 202.136088 199.440088 201.445984 3423200 0.0 0.0
2020-01-06 00:00:00-05:00 199.200855 202.430537 199.118032 202.393738 5682800 0.0 0.0
2020-01-07 00:00:00-05:00 201.970488 202.945833 199.578122 201.068756 5685400 0.0 0.0
2020-01-08 00:00:00-05:00 201.326415 205.163393 201.206792 204.077621 4916200 0.0 0.0